Scalable Graph Building from Text Data
نویسندگان
چکیده
In this paper we propose NNCTPH, a new MapReduce algorithm that is able to build an approximate k-NN graph from large text datasets. The algorithm uses a modified version of Context Triggered Piecewise Hashing to bin the input data into buckets, and uses an exhaustive search inside the buckets to build the graph. It also uses multiple stages to join the different unconnected subgraphs. We experimentally test the algorithm on different datasets consisting of the subject of spam emails. Although the algorithm is still at an early development stage, it already proves to be four times faster than a MapReduce implementation of NN-Descent, for the same quality of produced graph.
منابع مشابه
Automated Data Extraction from Scholarly Line Graphs
Line graphs are ubiquitous in scholarly papers. They are usually generated from a data table and often used to compare performances of various methods. The data in these figures can not be accessed. Manual extraction of this data is hard and not scalable. On the other hand, automated systems for such data extraction task is not yet available. We report an analysis of line graphs to explain the ...
متن کاملA Tandem Scalable Microwave-Assisted Williamson Alkyl Aryl Ether Synthesis under Mild Conditions
An efficient tandem synthesis of alkyl aryl ethers, including valuable building blocks of dialdehyde and dinitro groups under microwave irradiation and solvent free conditions on potassium carbonate as a mild solid base has been developed. A series of alkyl aryl ethers were obtained from alcohols in excellent yields by following the Williamson ether synthesis protocol under practical mild condi...
متن کاملScalable Corpus Annotation by Graph Construction and Label Propagation
The efficient annotation of documents in vast corpora calls for scalable methods of text classification. Representing the documents in the form of graph vertices, rather than in the form of vectors in a bag of words space, allows for the necessary information to be pre-computed and stored. It also fundamentally changes the problem definition, from a content-based to a relation-based classificat...
متن کاملDistributed NoSQL Storage for Extreme-Scale System Services
Today with the rapidly accumulated data, datadriven applications are emerging in science and commercial areas. On both HPC systems and clouds the continuously widening performance gap between storage and computing resource prevents us from building scalable data-intensive systems. Distributed NoSQL storage systems are known for their ease of use and attractive performance and are increasingly u...
متن کاملA Scalable Approach to Building a Parallel Corpus from the Web
Parallel text acquisition from the Web is an attractive way for augmenting statistical models (e.g., machine translation, crosslingual document retrieval) with domain representative data. The basis for obtaining such data is a collection of pairs of bilingual Web sites or pages. In this work, we propose a crawling strategy that locates bilingual Web sites by constraining the visitation policy o...
متن کامل